Optimize simple time ranged search queries by tontinton · Pull Request #5759 · quickwit-oss/quickwit

tontinton · 2025-04-19T17:56:28Z

When the search request contains a time range, we aborted the optimization of converting unneeded split searches into count queries.

Removed the TODO that was in the code for this.

Split the PR into 2: #5758.

rdettai

Thanks for your PR. The way the logic is split here between optimize_split_order and optimize makes it very hard to proof read. Currently it's already a bit confusing because the optimize() logic for each variant depends on the sort order which is different for each variant. It would be good if we could avoid tightening the coupling between the two match statements even further by making the sort orders more complex.

rdettai · 2025-04-22T09:40:07Z

        let min_required_splits = splits
            .iter()
+            // splits are sorted by whether they are contained in the request time range
+            .filter(|split| Self::is_contained(split, &request))
            .map(|split| split.num_docs)
            // computing the partial sum
            .scan(0u64, |partial_sum: &mut u64, num_docs_in_split: u64| {


Can you explain why this works? Say you have 5 splits:

1 is contained in the request

4 are only overlapping the request

In that case min_required_splits would be 1 here, but there might be a biggest_end_timestamp or smallest_start_timestamp in any of the other splits.

My thoughts were that there's a

.take_while(|partial_sum| *partial_sum < num_requested_docs)

so min_required_splits would not get to be 1, but now I see that I should validate that we actually reached this condition, and only if we reached it to set min_required_splits and optimize, otherwise return early.

I pushed a quick fix, I need to test it, but I think it answers the problem.

I didn't know of std::ops::ControlFlow 😄. Nevertheless, this reaches the limit when iterators stop being practical. A for loop is much more readable here.

tontinton · 2025-04-22T20:32:33Z

Thanks for your PR. The way the logic is split here between optimize_split_order and optimize makes it very hard to proof read. Currently it's already a bit confusing because the optimize() logic for each variant depends on the sort order which is different for each variant. It would be good if we could avoid tightening the coupling between the two match statements even further by making the sort orders more complex.

It is a bit confusing, I'll try to think of a better way to do this when I have some more time.

rdettai · 2025-04-23T10:13:22Z

I would try refactoring this into:

fn optimize()
  match self {
    CandSplitDoBetter::SplitIdHigher(_) => optimize_split_id_higher()
    CandSplitDoBetter::SplitTimestampLower(_) => optimize_split_timesamp_lower()
    ...
    CanSplitDoBetter::Uninformative => {}
  }

where optimize_split_xxx:

sorts
returns early if !is_simple_all_query
applies its variant specific optimization

This would regroup the optimizations logics with the sorts they depend on to be correct. (To help the review, ideally, re-implement the current logic in 1 commit, and in a separate commit add you logic extension.)

Thanks!

rdettai

Thanks! This is a bit more verbose, but I find it easier to read!

Check the comment marked "MORE IMPORTANT" first 😉

rdettai · 2025-04-30T08:05:46Z

+            if partial_sum >= num_requested_docs {
+                return Some(min_required_splits + 1);
            }
-            CanSplitDoBetter::SplitTimestampLower(_) => {
-                splits.sort_unstable_by_key(|split| split.timestamp_start())
+
+            min_required_splits += 1;


nit, this is more readable

min_required_splits += 1; if partial_sum >= num_requested_docs { return Some(min_required_splits); }

rdettai · 2025-04-30T08:07:05Z

+    fn get_min_required_splits(
+        splits: &[SplitIdAndFooterOffsets],
+        request: &SearchRequest,
+    ) -> Option<usize> {


nice, this is now really straightforward to understand and verify 😄

rdettai · 2025-04-30T08:20:50Z

+            let contained = Self::is_split_contained_in_search_time_range(split, &request);
+            (!contained, std::cmp::Reverse(split.timestamp_end()))


It is really hard to know here if contained splits come first or second (the ordering of boolean is not obvious, the ! negation doesn't help 😄 ). Could we maybe explicitly map true/false to something that is more clearly ordered:

let contained_first = match Self::is_split_contained_in_search_time_range(split, &request) { true => 0u8, false => 1u8, }; (contained_first, std::cmp::Reverse(split.timestamp_end()))

MORE IMPORTANT: now that I think about it, I thing it's not good idea to have contained splits processed first in the general case (!is_simple_all_query or get_min_required_splits().is_none()`). You miss an optimization opportunity, because once you'll have processed them, you will still need to check the splits overlapping the higher end of the range for better matches.

hmm, we can maybe instead of sorting by contained, keep the same sort mechanism as it was.

but still keep the new implementation of get_min_required_splits, meaning we won't optimize the first few splits which might not be contained in the search query's range.

made the change to not sort, tell me what you think.

rdettai · 2025-04-30T08:29:10Z

+        splits
+            .into_iter()
+            .map(|split| (split, (*request).clone()))
+            .collect::<Vec<_>>()


keep the comment:

// TODO: we maybe want here some deduplication + Cow logic

rdettai · 2025-04-30T08:50:01Z

+    fn optimize_split_id_higher(
+        &self,
+        request: Arc<SearchRequest>,
+        mut splits: Vec<SplitIdAndFooterOffsets>,
+    ) -> Result<Vec<(SplitIdAndFooterOffsets, SearchRequest)>, SearchError> {


we could actually call to_splits_with_request before (inside optimize) and simplify this signature to something like:

Suggested change

fn optimize_split_id_higher(

&self,

request: Arc<SearchRequest>,

mut splits: Vec<SplitIdAndFooterOffsets>,

) -> Result<Vec<(SplitIdAndFooterOffsets, SearchRequest)>, SearchError> {

fn optimize_split_id_higher(

&mut [(SplitIdAndFooterOffsets, SearchRequest)]

) -> Result<(), SearchError> {

This would prevent quite a lot of repetition and is also more explicit because it shows that we don't drop splits (note that &self is also dropped because not used)
(applies to all optimzie_split_xxx)

I modified to remove &self, tried what you suggested, but then simple stuff like calling is_simple_all_query that accepts a &SearchRequest became ugly, WDYT?

trinity-1686a · 2025-05-15T09:06:32Z

+        // In this case there is no sort order, we order by split id.
+        // If the the first split has enough documents, we can convert the other queries to
+        // count only queries.
+        for (_split, request) in split_with_req.iter_mut().skip(min_required_splits) {


i think this is now wrong due to get_min_required_splits skipping over some splits.
Imagine a 2 split scenario, split A contains 2 documents, one in time range and one not, split B contains 10 docs all in range, we want 10 docs. to_splits_with_request isn't fully contained in the time range, so it's ignored, B is in range, so it is counted, we get that one split is enough to get the docs we want. Now here B get disabled, because it isn't the 1st split, so only a single doc from A gets returned

oh nice catch, that's an easy fix, just min_required_splits += 1 when !Self::is_split_contained_in_search_time_range(split, request) {

pushed: https://github.com/quickwit-oss/quickwit/compare/110b53a2910425c8a9f8886d1e51b362794fd2e7..7904dc29201a79633f4427d22bb377d1012fb9a1

When the search request contains a time range, we aborted the optimization of converting unneeded split searches into count queries.

tontinton mentioned this pull request Apr 19, 2025

Get count from split metadata on simple time range query #5758

Open

tontinton changed the title ~~Optimize time ranged search queries~~ Optimize simple time ranged search queries Apr 19, 2025

tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from 804be29 to 61c3b74 Compare April 19, 2025 19:51

rdettai reviewed Apr 22, 2025

View reviewed changes

tontinton force-pushed the optimize-timestamp-range-simple-search branch from 61c3b74 to 851c15c Compare April 22, 2025 20:31

tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from 103b8d9 to 69e85ff Compare April 22, 2025 20:44

tontinton force-pushed the optimize-timestamp-range-simple-search branch 3 times, most recently from ed6cd7b to 3055965 Compare April 29, 2025 18:33

tontinton requested a review from rdettai April 29, 2025 18:36

tontinton force-pushed the optimize-timestamp-range-simple-search branch 3 times, most recently from cb52ca6 to 6b427b5 Compare April 29, 2025 18:43

rdettai requested changes Apr 30, 2025

View reviewed changes

tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from 4758505 to b7ea08e Compare April 30, 2025 20:55

Refactor leaf optimization so we match on CanSplitDoBetter only once

f469e53

tontinton force-pushed the optimize-timestamp-range-simple-search branch 2 times, most recently from f066b6c to 110b53a Compare April 30, 2025 21:03

tontinton requested a review from rdettai April 30, 2025 21:04

tontinton mentioned this pull request May 14, 2025

Remove timerange root search #5760

Open

trinity-1686a reviewed May 15, 2025

View reviewed changes

Optimize time ranged leaf search queries

7904dc2

When the search request contains a time range, we aborted the optimization of converting unneeded split searches into count queries.

tontinton force-pushed the optimize-timestamp-range-simple-search branch from 110b53a to 7904dc2 Compare May 15, 2025 13:47

tontinton requested a review from trinity-1686a May 15, 2025 13:48

		let contained = Self::is_split_contained_in_search_time_range(split, &request);
		(!contained, std::cmp::Reverse(split.timestamp_end()))

Uh oh!

Conversation

tontinton commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdettai left a comment

Choose a reason for hiding this comment

Uh oh!

rdettai Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tontinton commented Apr 22, 2025

Uh oh!

rdettai commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdettai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tontinton May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tontinton commented Apr 19, 2025 •

edited

Loading

rdettai Apr 22, 2025 •

edited

Loading

rdettai commented Apr 23, 2025 •

edited

Loading

tontinton May 15, 2025 •

edited

Loading